
\section {Results} \label{results}

In this section we describe the results of our experiment. We classify the results into objective and subjective. Objective results are those that were obtained from the data mined from the experimental subjects behavior, automatically extracted using the GIVE platform. We describe the objective results in Section~\ref{objective-results}. The subjective results were obtained through questionnaires given to the subject after the experiment was completed. We describe the subjective results in Section~\ref{subjective-results}. 

\subsection {Objective metrics} \label{objective-results}
%\emph{
%First: design of the GIVE software and logging, two conditions \\
%Second: Objective measures for MR vs OR \\ 
%Third: Average success rate \\
%Fourth: Average speed \\ }

%\fxnote{FL: interessant, peut-etre un peu court mais je sais que c'est difficile a faire.  Ne pas oublier de mentionner quels tests statistiques sont utilises (meme si c'est le plus simple d'entre eux)}
%\fxnote{SV: Which tests did we use apart from statistical significance? Should we do some other ones?} 

The GIVE platform and the design of the experiment was such that it logged the position of the subject and all visible objects every 200 ms. Furthermore, the reference given, the corresponding referent, and whether the subject pressed the referent button were logged each time a button was pressed. From these logs we extracted the \emph{number of errors} committed by the subjects while identifying the referents, that is, the number of REs that were not followed by the manipulation of the corresponding referent. The subject had 60 seconds to identify the referents. If this time passed and no button was manipulated, we considered this as an error. 

In order to test our hypothesis, we extracted information on \emph{whether} and \emph{how much} the number of errors decreased between the First Test Phase and the Second Test Phase. That is, we measured whether the Exercise Phase (that came between the First Test Phase and the Second Test phase) helped the subjects remember the words better. These metrics gave us evidence of whether practicing new vocabulary with overspecified references is more effective than practicing with minimal ones. 

Table~\ref{fig2} shows the percentage of subjects that decreased the number of errors in each condition. We call this metric \emph{lexeme acquisition rate}. From the 18 subjects in the OR condition that could improve (7 of the 25 had perfect score in the first test), 16 decreased their number of errors---89\% of the subjects improved in the OR condition. However, from the 19 subjects that could improve in the MR condition, only 9 did--- this represents only 56\% of the subjects. Table~\ref{fig2} also indicates the percentage of improvement on different kinds of properties included in the referring expressions. The REs in the test phases included taxonomical (e.g.~\emph{chair}) and absolute properties (e.g.~\emph{blue}). The subjects in the OR condition improved on both kinds of properties, however their performance on taxonomical properties was better. Taxonomical properties are realized as nouns, while absolute properties are realized as adjectives. Our results are then coherent with results from previous work showing that nouns are acquired prior to adjectives~\cite{genter82}. 

\begin{table}[h!]
\begin{center}
\begin{minipage}{8cm}
    \begin{tabular}{ | p{4.5cm} | p{1cm} | p{1cm} |}
    \hline
    Lexeme acquisition rate & MR & OR \\ \hline
    Any lexeme\footnote{Notice that these values are not equal to adding the other two since many subjects improved on both properties} (\%) & 56 & 89 \\ \hline
    Taxonomical lexemes (\%) & 53 & 100 \\ \hline
    Absolute lexemes (\%) & 45 & 87.5 \\ \hline
    \end{tabular}
\end{minipage}
\end{center}
\caption{\label{fig2} Percentages of subjects that decreased their number of errors comparing the first and second test phase classified by the type of lexeme.}
\end{table}

The lexeme acquisition rate shows \emph{whether} the subjects improved their performance in the  Second Test phase. Below we discuss \emph{how much} the subjects improved for each condition. 

We calculated the addition of errors of all the participants for the First Test and the Second Test for both the MR and the OR conditions. We show these results in Table~\ref{table-error}. The table also shows the difference between the amount of errors in the two phases; the \emph{error overcoming} measures what percentage of the errors were overcome in the Second Test Phase. The results show that a bigger percentage of errors was overcome in the OR condition (43\%) than in the MR condition (29\%).

\begin{table}[h!]
\begin{center}
    \begin{tabular}{ | p{4cm} | p{1cm} | p{1cm} |}
    \hline
    Number of errors & MR & OR \\ \hline
    %Test Referring Expressions & 175 & 175 \\ \hline
    First test errors & 55 & 65 \\ \hline
    Second test errors & 39 & 37 \\ \hline
    Difference between tests & 16 & 28 \\ \hline
    Error overcoming (\%) & 29 & 43 \\\hline
    \end{tabular}
\end{center}
\caption{\label{table-error} Number of errors committed in the first and second test phase and difference between the phases. 
}
\end{table}


% ergo, we calculated not only the average success rate for all objects, but also the specific success rate for each type of reference: Taxonomical (only the names of objects), Absolute (only colors), Minimal Reference, and Overspecified Reference. Average success was highest for Taxonomical references. The success rate for the OR condition was nearly 9% higher than for the MR condition.

In reference resolution experiments the reaction time is usually measured in order to detect hesitations during the resolution process, between the moment in which the referring expression is uttered and the moment when there is evidence of identification. Since our subjects can move freely in the virtual world, the distance from the subject to the referent influences the resolution time; if the subject is further away it will take him longer to manipulate the referent. In order to normalize our results, factoring out the distance, we use \emph{resolution speed} instead of distance. We calculated then the average speed in which the subjects in each condition resolved the referring expressions in the exercise phase. We only calculate this metric for the Exercise Phase because the REs received in the other phases were all minimal. We also calculate the average success rate achieved by each group during the Exercise Phase. Table~\ref{fig3} presents the average success rates and speeds per group. 

\begin{table}[h!]
\begin{minipage}{8cm}
\begin{center}
    \begin{tabular}{ | p{4cm} | p{1cm} | p{1cm} |}
    \hline
    Exercise phase performance & MR & OR \\ \hline
    Success rate (\%) & 58.00 & 66.67 \\ \hline
    Resolution speed (cm/s)\footnote{The metric unit used for speed is an interpretation of perceived size in the virtual world.} & 101.1 & 49.88\\ \hline
    \end{tabular}
\end{center}
\end{minipage}
\caption{\label{fig3} Average reaction speed and success rate for each condition in the exercise phase. All the differences between the OR and MR conditions are significant at $p \textless 0.01$ (paired t-tests)}
\end{table}

While the success rate for overspecified referring expressions (in the OR condition) is significantly higher, the speed in which the referring expressions was resolved is significantly lower. These results are coherent with previous work~\cite{Engelhardt_Bar_11} that reports that it takes more time to resolve overspecified referring expressions.  

We also collected demographic data, namely age, gender, number of languages spoken, familiarity with 3D video games and profession via a pre-questionnaire. None of these factors have a significant effect on lexeme acquisition rate and number of errors. However, we do find a significant effect of age and video game familiarity (p=0.00001 and p=0.001, respectively) on speed: younger gamers were faster. Our OR and MR groups were balanced with respect to age and video game familiarity. 

\subsection {Subjective Metrics} \label{subjective-results}
%\emph{
%First: explanation of the questionnaire \\ 
%Second: measures, correlations. \\}

After the experiment, subjects were asked to complete a questionnaire collecting subjective ratings of various aspects of the experiment. There were four subjective metrics in the post-questionnaire. Table~\ref{fig4} shows the average and standard deviation of the ratings given by the subjects.


\begin{table}[h!]
\begin{center}
    \begin{tabular}{ | p{3.6cm} | p{1.5cm} | p{1.5cm} |}
    \hline
  Metric Evaluated & MR (\%) & OR (\%) \\ \hline
  Q1) The instructions in Spanish were easy to understand  & 95.1 (9) & 94.2 (11) \\ \hline
  Q2) The descriptions in Russian were visible enough time for me to read them  & 90.2 (13) & 89.8 (14) \\ \hline
  Q3) The descriptions of objects in Russian were good descriptions  & 80.9 (15) & 79.4 (12) \\ \hline
  Q4) The exercise room helped me remember better the Russian words & \textbf{75.1} (21) & \textbf{85.8} (17) \\ \hline
    \end{tabular}
\end{center}
\caption{\label{fig4} Subjective Metrics collected in the post-questionnaire. Only for values in bold the differences between the OR and MR conditions are significant at $p \textless 0.01$ (paired t-tests).}
\end{table}

Metrics Q1 and Q2 verify that there were no technical problems with the experiment that could affect the results. Q3 aimed to reproduce previous experiments that show that people do not rate overspecified expressions worse even though it take them longer to resolve them~\cite{Engelhardt_Bailey_Ferreira_2006}. Q4 was the subjective metric that aimed to test our hypothesis in subjective terms. Q4 evaluated whether the subjects in the OR condition perceived that the Exercise Phase was more useful to acquire the lexemes than the subjects in the MR condition. We found that this was the case, since there is a significant difference between the two groups. 

In sum, we found that not only that there is an improvement between the First Test and Second Test Phases was greater for subjects in the OR condition, but also that they perceived the Exercise Phase as more useful in the OR condition. We discuss these results in the next section. 
